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HIGH- PERFORMANCE LOCK MANAGEMENT FOR FLASH COPY IN N-WAY 



SHARED STORAGE SYSTEMS 



Field of the Invention 

The present invention relates to the field of computer storage 
controllers, and particularly to advanced function storage controllers in 
n-way shared storage systems providing a Flash Copy function. 

Background of the Invention 

In the field of computer storage systems, there is increasing demand 
for what have come to be described as "advanced functions". Such 
f unctions go beyond the simple I/O functions of conventional storage 
controller systems. Advanced functions are well known in the art and 
depend on the control of metadata used to retain state data about the real 
or "user" data stored in the system. The manipulations available using 
advanced functions enable various actions to be applied quickly to virtual 
images of data, while leaving the real data available for use by user 
applications. One such well-known advanced function is 'Flash Copy 1 . 

At the highest level, Flash Copy is a function where a second image 
of 'some data' is made available. This function is sometimes known as 
Point-In-Time copy, or TO-copy. The second image's, contents are initially 
identical to that of the first. The second image is made available 
'instantly' . In practical terms this means that the second image is made 
available in much less time than would be required to create a true, 
separate, physical copy, and that this means that it can be established 
without unacceptable disruption to a using application's operation. 

Once established, the second copy can be used for a number of 
purposes including performing backups, system trials and data mining. The 
first copy continues to be used for its original purpose by the original 
using application. Contrast this with backup without Flash Copy, where the 
application must be shut down, and the backup, taken, before the application 
can be restarted again. It is becoming increasingly difficult to find time 
windows where an application is sufficiently idle to be shut down. The 
cost of taking a backup is increasing. There is significant and increasing 
business value in the ability of Flash Copy to allow backups to be taken 
without stopping the business. 

Flash Copy implementations achieve the illusion of the existence 'of a 
second image by redirecting read I/O addressed to the second image.' 
(henceforth Target) to the original image (henceforth Source) , uniess that- \ 
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region has been subject to a write. Where a region has been the subject of 
a write (to either Source or Target) , then to maintain the illusion that 
both Source and Target own their own copy of the data, a process is invoked 
which suspends the operation of the write command, and without it having 
taken effect, issues a read of the affected region from the Source, applies 
the read data to the Target with a write, then (and only if all steps were 
successful) releases the suspended write. Subsequent writes to the same 
region do not need to be suspended since the Target will already have its 
own copy of the data. This copy-on-write technique is well known and is 
used in many environments. 

. All implementations of Plash Copy rely on a data structure which 
governs the decisions discussed above, namely, the decision as to whether 
reads received at the Target are issued to the Source or the Target, and 
the decision as to whether a write must bfe suspended to allow the 
copy-on-write to take place. The data structure essentially tracks the 
regions or grains of data that have been copied from source to target, as 
distinct from those that have not. 

Maintenance of this data structure (hereinafter called metadata) is 
key to implementing the algorithm behind Flash Copy. 

Flash Copy is relatively straightforward to implement within a single 
CPU complex (possibly with SMP processors) , as is often employed within 
modern storage controllers. With a little more effort, it is possible to 
implement fault tolerant Flash Copy, such that (at least) two CPU complexes 
have access to a copy of the metadata. In the event of a failure of the 
first CPU complex, the second can be used to continue operation, without 
loss of access to the Target Image. 

However, the I/O capability of a single CPU complex is limited. 
Though improving the capabilities of a single CPU complex measured in terms 
of either l/Os per second, or bandwidth (MB/s) has a finite limit, and will 
eventually impose a constraint on the performance of the using 
applications. This limit arises in many implementations of Flash Copy, but 
a good example is in storage Controllers. A typical storage controller has 
a single (or possibly a redundant pair) of CPU complexes, which dictate a 
limit in the performance capability of that controller. 

t' 

More storage controllers can be installed. But the separate storage 
controllers do not share access to the metadata, and therefore do not 
cooperate in managing a Flash Copy image. The storage space becomes .... 
fragmented, with Flash Copy being confined to the scope of a single 
controller system. Both Source and Target disks must be managed within the." ■ 
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same storage controller. A single storage controller disk space might 
become full, while another has some spare space, but it is not possible to 
separate the Source and Target disks, placing the Target disk under the 
control of the new controller. (This is particularly unfortunate in the 
case of a new Flash Copy, where moving the Target is a cheap operation, as 
it has no physicaJL data associated with it) . 

As well as constraining the total performance possible for a 
Source/Target pair, the constraint of single-controller storage functions 
adds complexity to the administration of the storage environment. 

Typically, storage control systems today do not attempt to solve this 
problem. They implement Plash Copy techniques that are confined to a 
single controller, and hence are constrained by the capability of that 
controller. 

A simple way of allowing multiple controllers to participate in a 
shared Flash Copy relationship is to assign one controller as the Owner of 
the metadata, and have the other controllers forward all read and write 
requests to that controller. The owning controller processes the I/O 
requests as if they came directly from its own attached host servers, using 
the algorithm described above, and completes each I/O request back to the 
originating controller. 

The main drawback of such a system, and the reason that it is not 
widely used, is that the burden of forwarding each I/O request is too 
great, possibly even doubling the total system- wide cost, and hence 
approximately halving the system performance. 

It is known, for example, in the area of distributed parallel 
database systems, to have a distributed lock management structure employing 
a two-phase locking protocol to hold locks on data in order to maintain any 
copies of the data in a coherency relation. However, two phase locking is 
typically time- consuming and- adds a considerable messaging burden to the 
processing. As such, an unmodified two-phase locking protocol according to 
the prior art is disadvantageous in systems at a lower level in the 
software and hardware stack, such as storage area networks having 
distributed storage controllers where the performance impact of the passing 
of locking control messages .is even more significant than it is at the 
database control level. 



It would therefore be desirable to gain the advantages of distributed 
lock management in a Flash Copy environment while incurring the minimum 
lock messaging overhead.. 
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Summary of the Invention 

The present invention accordingly provides, in a first aspect, a 
storage control apparatus operable in a network of storage controller nodes 
having an owner storage controller node operable to control locking of a 
region of data during I/O activity, a messaging component operable to 

pass at least one message to request a lock, grant a lock, request release 
of a lock, and signal that a lock has been released; comprising an I/O 
performing component operable to perform I/O on data owned by any said 
owner storage controller node, subject to said I/O performing component's 
compliance with lock protocols controlled by said owner storage controller 
node; wherein any Flash Copy image of said region of data is maintained in 
a coherency relation with said region of data; and wherein said I/O 
performing component is operable to cache a previous positive confirmation 
that said region of data has been left in a coherency relation with any 
said Flash Copy image, and to perform I/O activity on the basis of said 
previous positive confirmation. 

Preferably, said I/O performing component is operable to discard a 
cached positive confirmation and to subsequently request a lock afresh. 

Preferably, a reduced cache storage area is managed by selectively 
discarding a cached positive confirmation. 

Preferably, a positive confirmation for said region of data further 
comprises any positive confirmation for a further region of data that is 
contiguous with said region of data. 

The present invention provides, in a second aspect, a method of 
storage control in a network of storage controller nodes having an owner 
storage controller node operable to control locking of a region of data 
during I/O activity, a messaging component operable to pass at least one 
message to request a lock, grant a lock, request release of a lock, and 
signal that a lock has been released, the method comprising the steps of 
performing I/O on data owned, by any said owner storage controller node, 
subject to compliance with lock protocols controlled by said owner storage 
controller node; maintaining Flash Copy of said region of data in a 
coherency relation with said region of data; and caching a previous 
positive confirmation that said region of data has been left in a coherency 
relation with any said Flash Copy image, and performing I/O activity on the 
basis of said previous positive confirmation. 
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Preferably, the method of the second aspect further comprises the 
step of discarding a cached positive conf irmation and subsequently 
requesting a lock afresh. 

Preferably, the method of the second aspect further comprises 
managing a reduced cache storage area by selectively discarding a cached 
positive confirmation. 

Preferably, a positive confirmation for said region of data further 
comprises any positive confirmation for a further region of data that is 
contiguous with said region of data. 

In a third aspect, the present invention provides a computer program 
product tangibly embodied in a computer- readable storage medium, comprising 
computer program code means to, when loaded into a computer system and 
executed thereon, cause storage control apparatus in a network of storage 
controller nodes having an owner storage controller node operable to 
control locking of a region of data during I/O activity, a messaging 
component operable to pass at least one message to request a lock, grant a 
lock, request release of a lock, and signal that a. lock has been released, 
to perform the- steps of performing I/O on data owned by any said owner 
storage controller node, subject to compliance with lock protocols 
controlled by said owner storage controller node; maintaining Flash Copy of 
said region of data in a coherency relation with said region of data? and 
caching a previous positive confirmation that said region of data has been 
left in a coherency relation with any said Flash Copy image, and performing 
I/O activity on the basis of said previous positive confirmation. 

The presently preferred embodiment of the present invention employs a 
modified two-phase 16ck messaging scheme to co-ordinate activity among 
plural storage controllers (or nodes) in an n-way system. The messaging 
co-ordinates activity between the nodes in the system, but each node is 
still responsible for performing its own I/O. Each of the nodes is 
provided with a caching facility to cache the results of its previous lock 
requests, so that it can reuse certain of the results to avoid the 
necessity of querying the status of a region of data if the status is 
already positively indicated by the cached previous results. 

Brief Description of the Drawings 

A preferred embodiment of the present invention will now be described 
by way of example only, with reference to the accompanying drawings, in 
which: 
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Figure 1 is a flow diagram illustrating one embodiment of a two-phase 
locking scheme using lock messages to control coherency between a region of 
data and a Plash Copy image of the data; 

Figure 2 shows the system components of a system according to the 
preferred embodiment of the present invention; and 

Figure 3 shows the additional steps of the preferred embodiment of 
the present invention. 

Detailed Description of the Preferred Embodiment 

For better understanding of the presently preferred embodiment of the 
present invention, it is necessary to describe the use of two-phase lock 
messaging to co-ordinate activity between plural storage controllers (or 
nodes) in an n-way storage system. 

As an example, consider an n-way system implementing Flash Copy. 
Assume every node has access to the storage managed by the co-operating set 
of n nodes. One of the nodes is designated as an owner (102) for metadata 
relating to all I/O relationships of a region of storage. The other nodes 
are designated as clients. In a presently most preferred embodiment, one 
of the client nodes is further designated as a backup owner and maintains a 
copy of the metadata in order to provide continuous availability in the 
event of a failure of the owner node. 

Consider a host I/O request arriving (104) at a particular client 
node CC) . Suppose that the host I/O request is either a Read or Write of 
the Target disk, or possibly a Write of the Source disk. Client C begins 
processing by suspending (106) the I/O. C then sends (108) a message REQ 
to the Owner node O, asking if the grain has been copied. 

On receipt of message REQ, O inspects its own metadata structures. 
If 0 finds that the region has already been copied, O replies (110) with a 
NACK message. If O finds that the region has not already been copied, it 
places a lock record against the appropriate metadata for the region within 
its own metadata structures, and replies (112) with a GNT message. The 
lock record is required to ensure compatibility between the request just 
received and granted, and further requests that might arrive affecting the 
same metadata while the processing at C continues. The details of how the 
lock record is maintained, and what the compatibility constraints are, are 
the same as if the I/O had been received locally by O, and are well known 
to those skilled in the art. 
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On receipt of a X5ACK message, c unpends (114) the original I/O 
request . 

On receipt of the GMT message, C continues (116) by performing the 
data transfer or transfers required by the Flash Copy algorithm, m the 
case of a Target Read, this means performing the read to the source disk. 
Some time later, C will process completion (118) for the read request, and 
will issue (120) an UNL message to O, at the same time as .completing the 
orig i nal I/O request to the host system that issued it. 

O, on receipt of an DNL message, removes (122) the lock record from 
its metadata table, thus possibly releasing further I/O requests that were 
suspended because of that lock. In the presently most preferred 
embodiment , O then delivers (124) a UNLD message to C, allowing C to reuse 
the resources associated with the original request. This is, however, not 
required by the Flash Copy algorithm itself. 

In the case of a write (to either Target or Source) C must perform 
the copy-on-write (127) . Having completed all steps of the copy-on- write, 
and with the original write I/O request still suspended, C issues. (126) an 
UNLC request to O. 

O, on receipt of an OKLC message, marks (128) in the metadata the 
region affected as having been copied, removes (130) the lock record, 
informs (132) any waiting requests that the area has now been copied, and 
then issues (134) an UNLD message to C. 

C, on receipt of a UNLD message, releases (136) the suspended write 
operation, which will some time later complete, and then C completes (138) 
the write operation to the host. 

Recovery paths are required to cater for the situation where a disk 
I/O fails, or the messaging system fails, or a node fails, but the 
requirements and implementations of these are well understood in the axt. 

The above description was from the point of view of a single I/O, and 
a single Client C. But it is clear how the scheme continues to operate in 
the presence^of multiple i/Os, from multiple client nodes, with O 
continuing to process all requests by the same algorithm. 

Turning now to Figure 2, there is shown an apparatus in 
accordance with a presently preferred embodiment of the invention, the 
apparatus embodied in a storage controller network comprising an Owner 
(202) , a Client (204) I/O performing component, a portion of metadata (206) 
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relating to data (208) held under the control of the storage network, a 
copy (209) of the data (208), and communication means. The apparatus 
includes an ownership assignment component (210) to assign ownership of 
metadata to the Owner (202) f and a lock management component (212) operable 
to control locking at said metadata (206) level during I/O activity to 
ensure coherency with any copy (209) . Included also is a messaging 
component (214) at the Owner (202), the messaging component (214) being 
operable to pass one or more messages between Client (204) and Owner (202) 
to request a response regarding a metadata state, grant a lock, request 
release of a lock, and signal that a lock has been released. The Client 
(204) is operable to perform I/O on data whose metadata is owned by any 
Owner (202), subject to the Client's (204) compliance with the lock 
protocols at the metadata level controlled by said Owner (202) . 

The system and method thus described are capable of handling 
distributed lock management in an n-way shared storage controller network, 
but have the disadvantage that there is considerable messaging overhead in 
the system. This is not burdensome in systems containing relatively few 
controllers or where there is relatively little activity, but in modern 
storage systems, such as very large storage area networks, there are likely 
to be many controllers and a very high level of storage activity. In such 
circumstances , the avoidance of unnecessary messaging overhead would be 
advantageous . 

Thus, to improve the processing performance of the system, in the 
most preferred embodiment of the present invention, each client node is 
provided with the ability to maintain information which records the last 
response received from an Owner. Specifically (described in terms of the 
additions to Figure 1 according to Figure 3), the client node C is 
permitted to cache (308) the information that it received a HACK after step 
114 of Figure 1, or that itself issued and had acknowledged an UNLC/UNLD 
pair at step 126 and after step 134 of Figure 1. 

On receipt (302) of a host I/O request as at step 104 of Figure 1, 
Client C now applies a modified lock control algorithm, as follows. 

C first inspects its cached data (303), to see if it has a positive 
indication that the region affected has already been copied. If it has, 
then it continues (304) with the l/o without sending any protocol messages 
to O. 



If the cache contains no such positive indication, the unmodified 
protocol described above is used. Client C proceeds with step 106 and 
following of Figure 1. The receipt (306) of a NACK or an UNLC/UNLD pal: 
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causes the cache of information to be updated (308) , and subsequent I/Os 
that affect that region, finding this information in the cache at (3 03) , 
can proceed (304) without issuing- any protocol messages. 

The term •pessimistic cache' is sometimes used to describe the 
approach needed in the presently most preferred embodiment of the present 
invention. This means that the client need not be fully up-to-date with 
the Owner's metadata; the client may believe an area needs to be copied, 
and be corrected by the owner (N2LCK) to say it does not. However, the 
client must never believe that an area has been copied when the owner knows 
it has not. 

The lock caching of the presently preferred embodiment requires 
certain changes to the client for correct operation. First, the cache must 
be initialised (301) (to indicate that all regions must be copied) each 
time a Flash Copy relationship is started (3 00a) . This might be driven in 
a number of ways, but a message from Owner to Client is the most 
straightforward to implement. Second, any time a client node might have 
missed a message (3 00b) indicating that the cache has been reinitialised 
(perhaps because of a network disturbance) , the client must assume the 
worst case and reinitialise (301) or revalidate its cache. 

Further extensions and variations are possible, as will be clear to 
one skilled in the art. For example the cached information is discardable, 
as it can always be recovered from the owner node, which has the only truly 
up-to-date copy. Thus, the client could have less metadata space allocated 
for caching information than would be required to store all the metadata 
held on all the nodes. The clients could then rely on locality of access 
for the I/Os they process to ensure that they continue to benefit from the 
caching of the lock message information. 

In a further extended embodiment, a NACK message (and also the GNT or 
UNIiD messages) can carry back more information than that relating to the 
region being directly processed by the REQ/GNT/umtC/UNLD messages. 
Information concerning neighbouring regions that have also been cleaned can 
be sent from owners to clients. 

It will be appreciated that the method described above will typically 
be carried out in software running on one or more processors (not shown) r 
and that the software may be provided as a computer program element carried 
on any suitable data carrier (also not shown) such as a magnetic or optical 
computer disc. The channels for the transmission of data likewise may 
include storage media of all descriptions as well as signal carrying media, 
such as wired or wireless signal media. 
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The present invention may suitably be embodied as a computer program 
product for use with a computer system. Such an implementation may 
comprise a series of computer readable instructions either fixed on a 
tangible medium, such as a computer readable medium, for example, diskette, 
CD-ROM, ROM, or hard disk, or transmittable to a computer system, via a 
modem or other interface device, over either a tangible medium, including 
but not limited to optical or analogue communications lines, or intangibly 
using wireless techniques, including but not limited to microwave, infrared 
or other transmission techniques- The series of computer readable 
instructions embodies all or part of the functionality previously described 
herein. 

Those skilled in the art will appreciate that such computer readable 
instructions can be written in a number of programming languages for use 
with many computer architectures or operating systems. Further, such 
instructions may be stored using any memory technology, present or future, 
including but not limited to, semiconductor, magnetic, or optical, or 
transmitted using any communications technology, present or future, 
including but not limited to optical, infrared, or microwave. It is 
contemplated that such a computer program product may be distributed as a 
removable medium with accompanying printed or electronic documentation, for 
example, shrink-wrapped software, pre-loaded with a computer system, for 
example, on a system ROM or fixed disk, or distributed from a server or 
electronic bulletin board over a network, for example, the Internet or 
World Wide Web. 

It will be appreciated that various modifications to the embodiment 
described above will be apparent to a person of ordinary skill in the art. 
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CLAIMS 

1. A storage control apparatus operable in a network of storage 
controller nodes having an owner storage controller node operable to 
control locking of a region of data during I/O activity, a messaging 
component operable to pass at least one message to request a lock, grant a 
lock, request release of a lock, and signal that a lock has been released; 
comprising: 

an I/O performing component operable to perform I/O on data owned by 
any said owner storage controller node, subject to said I/O performing 
component's compliance with lock protocols controlled by said owner storage 
controller node; 

wherein any Flash Copy image of said region of .data is maintained in 
a coherency relation with said region of data; and 

wherein said I/O performing component is . operable to cache a previous 
positive confirmation that said region of data has been left in a coherency 
relation with any said Flash Copy image, and to perform I/O activity on the 
basis of said previous positive confirmation. 

2. An apparatus as claimed in claim 1, wherein said I/O performing 
component is operable to discard a cached positive confirmation and to 
subsequently request a lock afresh. 

3 * An apparatus as claimed in claim 2 , wherein a reduced cache storage 
area is managed by selectively discarding a cached positive confirmation. 

4. An apparatus as claimed in claim 1, wherein a positive confirmation 
for said region of data further comprises any positive confirmation for a 
further region of data that is contiguous with said region of data. 

5. A method of storage control in a network of storage controller nodes 
having an- owner storage controller node operable to control locking of a 
region of data during I/O activity, a messaging component operable to pass 
at least one message to request a lock, grant a lock, request release of a 
lock, and signal that a lock has been released, the method comprising the 
steps of: 

performing I/O on data owned by any said owner storage controller 
node, subject to compliance with lock protocols controlled by said owner 
storage controller node; 
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maintaining Flash. Copy of said region of data in a coherency relation 
with said region of data; and 

caching a previous positive confirmation that said region of data has 
been left in a coherency relation with any said Flash Copy image, and 
performing I/O activity on the basis of said previous positive 
confirmation. 

6. A method as claimed in claim 5, further comprising the step of 
discarding a cached positive confirmation and subsequently requesting a 
lock afresh, 

7. A method as claimed in claim 6, further comprising managing a reduced 
cache storage area by selectively discarding a cached positive 
confirmation . 

8. A method as claimed in claim 5, wherein a positive confirmation for 
said region of data further comprises any positive confirmation for a 
further region of data that is contiguous with said region of data. • 

9 . A computer program product tangibly embodied- in a computer- readable 
storage medium, comprising computer program code means to, when loaded into 
a computer system and executed thereon, cause storage control apparatus in 
a network of storage controller nodes having an owner storage controller 
node operable to control locking of a region of data during I/O activity, a 
messaging component operable to pass at least one message to request a 
lock, grant a lock, request release of a lock, and signal that a lock has 
been released, to perform the steps of: 

performing I/O on data owned by any said owner storage controller 
node, subject to compliance with lock protocols controlled by said owner 
storage controller node; 

maintaining Flash Copy of said region of data in a coherency relation 
with, said region of data; and 

caching a previous positive confirmation that said region of data has 
been left in a coherency relation with any said Flash Copy image, and 
performing I/O activity on the basis of said previous positive 
confirmation. 
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ABSTRACT 



HIGH-PBRFORMANCB LOCK MANAGEMENT FOR FLASH COPY IN N-WAY 

SHARED STORAGE SYSTEMS 

Storage control apparatus is operable in a network of storage 
controller nodes having an owner node operable to control locking of a 
region of data during I/O activity, a messaging component operable to 

request locks, grant locks, request release of locks and signal lock 
release. The apparatus comprises an I/O performing component to perform 
I/O on data owned by any owner node, subject to compliance with lock 
protocols controlled by the owner node. Any Flash Copy image of the region 
of data is maintained in a coherency relation with the region of data, and 
the I/O performing component is operable to cache a previous positive' 
confirmation that the region of data has been left in a coherency relation 
with any said Flash Copy image, and to perform I/O activity on the basis of 
said previous positive confirmation. 
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